1 Abstract

This analysis explores the effects of different chemicals in white wine on the taste and reported quality of individual wines.

2 Introduction

A study from the University of Minho detailed the chemical profiles and subjective rating of several different wines. The full profile of the data, collection methods and descriptions of objective and subjective attributes can be found here: link

What makes a good wine, and whether a wine can be considered objectively superior to others, is the subject of intense debate. Some studies suggest that the perception of a wine’s quality is based more on the container or price than the beverage itself — for a summary of some statistical analyses of wine tasting competitions, check out this article link.

But many of these analyses do not take the chemical profile of the wines into account. In this analysis, I will be exploring whether the chemically measured aspects of the individual wine samples had any meaningful effect on the wine’s rating.

3 Exploration

3.1 Univariate Exploration

## [1] 4898   13
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

This dataset contains 4898 observations - so almost 5000 individual wine samples - of 13 variables. Eleven of those are the chemical profiles of the wines, one is the score the wine has been given, and X is the sample number. This dataset seems very tidy, so no further cleaning is needed.

While “quality” could have easily been an ordered factor, it was recorded as an int. For now, however, this presentation serves our purpose just fine - so lets get a summary.

##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

3.1.0.1 Quality

Wine samples were scored on a scale from 0 to 10. Reported scores in the dataset ranged from 3 to 9, with a median of 6 and an average of 5.878 (wines were given whole number scores). Time for a histogram. This normally distributed plot makes it clear that the vast majority of wine samples are rated slightly above average, with only around 400 samples in total scoring either below 5 or above 7.

Only 5 wines scored a 9. I wonder how their chemical compositions compare to the entire dataset’s.

While overlaying the averages of the top scoring wines on the histograms of the entire samplings’ data by variable would be enlightening, it might also be a little overkill for the purposes of this exploration, so let’s just compare summary ranges and means for now.

##        X          fixed.acidity  volatile.acidity  citric.acid   
##  Min.   : 775.0   Min.   :6.60   Min.   :0.240    Min.   :0.290  
##  1st Qu.: 821.0   1st Qu.:6.90   1st Qu.:0.260    1st Qu.:0.340  
##  Median : 828.0   Median :7.10   Median :0.270    Median :0.360  
##  Mean   : 981.4   Mean   :7.42   Mean   :0.298    Mean   :0.386  
##  3rd Qu.: 877.0   3rd Qu.:7.40   3rd Qu.:0.360    3rd Qu.:0.450  
##  Max.   :1606.0   Max.   :9.10   Max.   :0.360    Max.   :0.490  
##  residual.sugar    chlorides      free.sulfur.dioxide total.sulfur.dioxide
##  Min.   : 1.60   Min.   :0.0180   Min.   :24.0        Min.   : 85         
##  1st Qu.: 2.00   1st Qu.:0.0210   1st Qu.:27.0        1st Qu.:113         
##  Median : 2.20   Median :0.0310   Median :28.0        Median :119         
##  Mean   : 4.12   Mean   :0.0274   Mean   :33.4        Mean   :116         
##  3rd Qu.: 4.20   3rd Qu.:0.0320   3rd Qu.:31.0        3rd Qu.:124         
##  Max.   :10.60   Max.   :0.0350   Max.   :57.0        Max.   :139         
##     density             pH          sulphates        alcohol     
##  Min.   :0.9897   Min.   :3.200   Min.   :0.360   Min.   :10.40  
##  1st Qu.:0.9898   1st Qu.:3.280   1st Qu.:0.420   1st Qu.:12.40  
##  Median :0.9903   Median :3.280   Median :0.460   Median :12.50  
##  Mean   :0.9915   Mean   :3.308   Mean   :0.466   Mean   :12.18  
##  3rd Qu.:0.9906   3rd Qu.:3.370   3rd Qu.:0.480   3rd Qu.:12.70  
##  Max.   :0.9970   Max.   :3.410   Max.   :0.610   Max.   :12.90  
##     quality 
##  Min.   :9  
##  1st Qu.:9  
##  Median :9  
##  Mean   :9  
##  3rd Qu.:9  
##  Max.   :9

3.1.0.2 Fixed Acidity

Described in the accompanying file as tartaric acid, “most acids involved with wine or fixed or nonvolatile”, and measured in g / dm^3.

The lowest value for fixed acidity was 3.8, the highest was 14.2, and the mean was 6.855.

The wines with 9s had a mean of 7.4, with quartiles at 6.9 and 7.4. These numbers are all slightly higher than the total mean – potentially significant, potentially just a sampling bias. Possible candidate for further investigation.

3.1.0.3 Volatile acid

This is the amount of acetic acid in each wine sample in g / dm^3. Too much acetic acid can cause wine to taste like vinegar.The minimum level for all samples was 0.08, the maximum level was 1.1, and the mean was 0.27.

The top scorers here had a mean of 0.29, with a minimum and maximum of 0.24 and 0.36 respectively. In this regard, these wines seem average.

3.1.0.4 Citric Acid

Described as adding the flavor “freshness” to a wine, the lowest amount found in the whole data set was 0.0000 g / dm^3. The most was 1.66, but the average was 0.32, with first and third quartiles clustered nearby at 0.27 and 0.39, respectively.

The minimum citric acid reported value for the top 5 was 0.29, and the top was 0.49. The average was 0.38 and the quartiles were 0.34 and 0.45 - potentially skewed towards slightly higher concentrations than the total population? It might be interesting to see if exceptionally low values correspond to lower quality ratings.

3.1.0.5 Residual Sugar

While this dataset included an apparent (superior) sweet wine or two, with a max reported value of 65.8 grams/liter (wines over 45 grams/liter are considered sweet), the mean residual sugar value was 6.391 g/l, with quartiles at 1.7 and 9.9. The minimum value was 0.6.

The gold standard wines, in contrast, had a mean of 4.12 g/l, with quartiles at 2.0 and 4.2. The lowest was 1.6, and the highest seemed to be a bit of an outlier at 10.6. It is interesting to note that the top rated wines mostly fell between the first quartile and mean of the total dataset.

3.1.0.6 Chorides

The amount of salt in the wine, measured in g / dm^3. The total dataset had a minimum of 0.009, a maximum of 0.34, and an average of 0.04577.

The 5 best wines were all below the total average, with a mean of 0.0274. This seems like a potential candidate for a predictor of wine quality.

3.1.0.7 Free Sulfur Dioxide

Important in preventing microbial growth and oxidation, the range for the dataset’s free SO2 values was from 2 to 289. The mean was 35.31 (values were reported in whole numbers here as well), and the quartiles were nearby at 23 and 46.

For the top tier wines, the mean was 33.4, with quartiles at 27 and 31, the minimum at 24, and the max potentially outlying at 57. While free sulfur dioxide might be passively important to insuring wine quality – by preventing negative aspects of aging such as oxidation – these values seem reasonably average.

3.1.0.8 Total Sulfur Dioxide

The total amount of both bound and free forms of SO2 in the sample, measured in mg / dm^3. Variables ranged from 9 to 440, with an average of 138.4 and quartiles at 108 and 167.

The SO2 average for the best wines was, in contrast, 116, with a minimum of 85 and a maximum of 139. A lower, but not too low, total SO2 count seems potentially important for taste.

3.1.0.9 Density

The density of most wine is close to water, but is affected by alcohol and sugar content. The average density of all wines was 0.994 g / cm^3, with quartiles at 0.9917 and 0.9961.

The quartiles for supreme sample density were at 0.9898 and 0.9906, and the mean was 0.9915 - skewed above the quartiles, most likely by the maximum value at 0.997. Something wonky may be going on here, but as density is affected by alcohol and sugar content, it might make more sense to check those variables’ effect on quality score first.

3.1.0.10 pH

An average pH score for wine is between 3 and 4, and all wines tested fell between a pH of 2.72 and 3.82. The mean for all wines was 3.188.

The mean for the top scorers was 3.308, and all five samples fell above the total mean, ranging between 3.2 and 3.41. This is another case where the difference could be significant, or simply an effect of the small sample size.

3.1.0.11 Sulphates

Potassium sulphate is often added to wine to bolster SO2 levels, again to prevent microbe growth and wine oxidation. All wine samples fell between 0.22 g / dm^3 and 1.08 g / dm^3, with quartiles at 0.41 and 0.55. The average sulphate content of all samples was 0.48.

The top shelf samples fell between 0.36 and 0.61, with quartiles at 0.42 and 0.48 and an average of 0.46. This seems sufficiently average, and makes me wonder about the process of SO2 management in wine production - perhaps an industry standard amount determines the levels added, rather than a chemical assay on how much a young wine needs to prevent spoiling? If so, this could account for the widely fluctuating levels of total SO2 in the wines, as compared to the variation in other chemical ranges.

3.1.0.12 Alcohol

Arguably the chemical factor with the most agency to improve a wine’s score based on the number of other samples an expert has judged in one go. The range for all samples was between 8% by volume and 14%. The first and third quartiles were at 9.5% and 11.4%, and the average volume of alcohol was 10.51% for all samples.

The top scorers all fell between 10.4% and 12.9%, and the average was 12.18% — too much of a difference not to investigate given the physiopsychological effects of alcohol on humans.

3.1.1 Thoughts on univariate profiles of top scorers

Using only the data from the 5 best wines to select which chemical aspects to investigate has some obvious draw backs, especially the constant potential for artificially high or low means when compared to the total dataset means, I felt it was a good, quick way to identify which factors had the biggest potential to predict whether a wine would score exceptionally well. Since I am most interested in finding out which chemical aspects make a wine excel, and which are not correlated with better scores.

Another suspicion I have about this data is that some chemicals are likely more strongly related with a wine not being bad - such as, too much citric acid making a wine taste funny, but below a threshold falls to personal preference or is unreliably detected by the human tongue. This would be another, very interesting, investigation, and I suspect it might be far easier to correlate certain levels of chemicals to drinkable versus undrinkable, rather than trying to correlate certain levels of chemicals with excellence. That is not, however, the analysis I will do here.

The variables I am most interested in, after comparing quartiles and means, are Chloride, Total Sulfur Dioxide, alcohol, and pH. Let’s look a little closer with some histograms.

With a binwidth of 0.001, we see a tall, skinny but (mostly) normally distributed majority, with a bit of a tail starting arount 0.08. I’ll be interested to see the distribution of sample scores above and below that number.

Fairly normally distributed with a binwidth of 5, with just a handful of outliers on the high end.

This plot, which at first suggested a bit of modality, comes out as fairly regular around the mean (3.18) when binwidth is adjusted to 0.01.

This follows a low, slow right skew, with many more wines of around 9.5% alcohol. The most interesting breakdown of this to me so far would be to check the quality rankings between 9 and 9.5, between 9.5 and 11.5, and 12 to 12.5.

3.2 Bivariate Analysis

As a quick check, I want to do a correlation matrix.

It seems the strongest correlations in the dataset involve alcohol, density and sugar, as well as total sulfur dioxide. The highest correlation with quality is alcohol, at 0.4. Close behind are chlorides, total sulfur dioxide and density; however, as a liquid’s density is affected by alcohol content, density is also strongly related to alcohol.

3.2.0.1 Chloride levels

I chose a violin plot for this variable after noticing a huge number of outliers and similar medians in a box plot. For the entire set of chloride levels, we see that the maximum density of observations lowers as quality increases, but that on the whole, the range where most observations exist stays fairly consistent. From the earlier histogram, I remember that on either side of 0.08, I wondered if I would see a trend in quality scores: lets break those down now.

Two histograms wrapped by new variable chloride.threshold, created using the cut() function, show that neither group clusters around either higher or lower scores; both can be said to focus around the median of quality scores.

3.2.0.2 Total SO2 Levels

This plot seems to occilate towards a narrower range as quality goes up. Lower scores having the largest quantiles and error bars, and the error geting smaller and smaller the higher the quality. The trend is especially strong between the biggest three groups of samples – score 5, 6, and 7 – and the medians and error bars do seem to narrow towards a point. This may be worth investigating in a linear model later on.

3.2.0.3 pH Levels

For density, I tested overlaying a violin plot with a box plot, to see how the mean changed as well as the distribution of observations around those medians. There does seem to be a trend towards higher qualities being slightly less dense - but, the density range of all the over-5 quality samples is somewhat wide. As density is affected by the alcohol and sugar content, it is unclear whether density is the affective variable or simply a side effect of preferred sugar and alcohol levels.

3.2.0.4 Alcohol Content

This is an interesting plot. The mean alcohol content between score 5 and 9 moves strongly upwards. Score 3 and 4 move downwards towards the local minimum at 5 - however, its worth noting that score 5 is the only category with more than two outliers, so the calculated mean may be artificially low. This might be worth a closer look.

This plot shows a distinct trend in the means. Despite the density and range of lower quality score alcohol levels, higher quality scores appear to more often have higher alcohol levels (around 12%) and lower quality scores tend to have lower alcohol levels (around 9.8%).

3.2.1 Thoughts on Bivariate

What I notice across all the graphs is that the means of all chemicals but alcohol are within quantiles of eachother, with error bars and outliers concentrating around the quality categories with the highest counts, as one would expect.

So, lets check out whether certain chemical balances score higher.

4 Multivariate Investigation

This graph shows a strong inverse linear relationship between density and alcohol, which makes sense, but along that trend, we see that higher quality categories tend towards lower density and higher alcohol percentages.

It appears that wines with higher quality scores may have somewhat less sulfur dioxide, although again, whether wines score better due to lower total SO2 or whether wines with more alcohol simply tend to have less SO2 is yet to be seen.

4.1 Bonus Round: Linear model

## [1] "high-quality (5+) model"
## 
## Calls:
## hq1: lm(formula = alcohol ~ quality, data = high.quality)
## hq2: lm(formula = alcohol ~ quality + density, data = high.quality)
## hq3: lm(formula = alcohol ~ quality + density + residual.sugar, data = high.quality)
## 
## =======================================================
##                      hq1         hq2          hq3      
## -------------------------------------------------------
##   (Intercept)      6.265***   296.615***   531.921***  
##                   (0.118)      (3.714)      (5.895)    
##   quality          0.716***     0.352***     0.192***  
##                   (0.020)      (0.014)      (0.012)    
##   density                    -289.920***  -526.695***  
##                                (3.708)      (5.919)    
##   residual.sugar                             0.156***  
##                                             (0.003)    
## -------------------------------------------------------
##   R-squared            0.2         0.7          0.8    
##   adj. R-squared       0.2         0.7          0.8    
##   sigma                1.1         0.7          0.6    
##   F                 1318.6      4571.9       5190.4    
##   p                    0.0         0.0          0.0    
##   Log-likelihood   -7107.3     -5146.1      -4247.3    
##   Deviance          5627.4      2449.2       1672.8    
##   AIC              14220.7     10300.3       8504.7    
##   BIC              14240.1     10326.1       8537.0    
##   N                 4715        4715         4715      
## =======================================================
## 
## Call:
## lm(formula = alcohol ~ quality + density + residual.sugar, data = high.quality)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8541 -0.3864 -0.0611  0.3571 15.6160 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     5.319e+02  5.895e+00   90.24   <2e-16 ***
## quality         1.923e-01  1.192e-02   16.14   <2e-16 ***
## density        -5.267e+02  5.919e+00  -88.99   <2e-16 ***
## residual.sugar  1.555e-01  3.326e-03   46.76   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5959 on 4711 degrees of freedom
## Multiple R-squared:  0.7677, Adjusted R-squared:  0.7676 
## F-statistic:  5190 on 3 and 4711 DF,  p-value: < 2.2e-16

## [1] "low-quality(3&4) model"
## 
## Calls:
## lq1: lm(formula = chlorides ~ quality, data = subset(wineQualityWhites, 
##     as.numeric(quality) <= 4))
## lq2: lm(formula = chlorides ~ quality + density, data = subset(wineQualityWhites, 
##     as.numeric(quality) <= 4))
## lq3: lm(formula = chlorides ~ quality + density + alcohol, data = subset(wineQualityWhites, 
##     as.numeric(quality) <= 4))
## 
## ================================================
##                     lq1       lq2       lq3     
## ------------------------------------------------
##   (Intercept)      0.067*  -3.959***  -1.885    
##                   (0.027)  (0.799)    (1.106)   
##   quality         -0.004   -0.002     -0.004    
##                   (0.007)  (0.006)    (0.006)   
##   density                   4.040***   2.036    
##                            (0.801)    (1.089)   
##   alcohol                             -0.007**  
##                                       (0.003)   
## ------------------------------------------------
##   R-squared          0.0       0.1        0.2   
##   adj. R-squared    -0.0       0.1        0.1   
##   sigma              0.0       0.0        0.0   
##   F                  0.4      12.9       11.3   
##   p                  0.5       0.0        0.0   
##   Log-likelihood   390.8     402.9      406.5   
##   Deviance           0.1       0.1        0.1   
##   AIC             -775.7    -797.8     -803.0   
##   BIC             -766.0    -785.0     -786.9   
##   N                183       183        183     
## ================================================

5 Final Plots and Summary

5.1 Plot One: Chloride Level Violin Plot

5.1.1 Discussion

This plot highlights a trend present in most of the scientifically obtained variables - the wide range of outliers centered on the lower-middle end of the score. Chlorides especially exhibit an almost-flat line for quantiles, mean (red) and median (blue dashed) scores across quality levels.

5.2 Plot Two: Drunk in Love (With R)

5.2.1 Discussion

This chart explored how a sample’s alcohol percentage related to its quality. With many more samples in the lower spectrums again, it was exciting to see a linear trend emerge. With fewer samples in the higher scores, we see the error increase along with the quality score - but our linear model supports that both of these can be used to predict a wine’s score.

5.3 Plot 3: Lining up the variables

5.3.1 Discussion

This chart explores in more detail the bivariate trends at each quality level. While both Density and Total Sulfur Dioxide display more variety at score 5 and 6, a non-flat linear trend persists in the Density X Alcohol breakdown, whereas a non-flat linear trend does not emerge in higher quality levels for TSO2. Since Alcohol exhibited the strongest trend, you’d expect that if a particular level of TSO2 was an indicator of quality, you’d find that level paired with any one quality score more often than the others, and that is the opposite of what we see.

6 Reflection

This dataset suggests, after scrutiny, that certain factors (alcohol, residual sugar, and density) may be more strongly correlated to a sample’s score than others (chlorides, sulfur dioxide, pH). With so few samples of exceptional wines, its hard to classify what makes a great wine as opposed to a mediocre or bad wine; however, my observation is that these three more-predictive variables could be classified as sort of “macro” flavors - flavors all humans can probably detect easily and accurately.

The dataset did not provide any ranges for humans’ ability to detect or distinguish between concentrations of any of the chemicals, let alone those that the study mentioned as “important” to a wine’s flavor. Smell plays a huge role in how we percieve taste, as well, and this dataset didn’t include any variables in that category. Since its highly possible that humans have perception thresholds for some of these chemicals that are much higher than the reported values, this dataset only represents a tiny fraction of the variables involved with tasting wine.

My initial reaction to this dataset was to worry that I would not uncover any strong relationships. As initial probes with histograms seemed to confirm this fear, I really struggled with how to select trends and relationships worth investigating, and debated for a time simply performing repetitive, standard graphs of all the variables to brute-force any possible relationships out.

As the data progressed and a few clear front-runners came out, I next grappled with the opposite issue - suddenly, only a few variables seemed worth looking at! All my graphs were, for a period, different permutations of Alcohol X Quality, trying to discover the best way to capture the relationship. Eventually I decided that I wanted to include some of the variables with uninteresting graphs - see those flat lines of total sulfur dioxide and messy chlorides - because I felt that they helped to justify making such a big deal out of so slight a relationship as that between alcohol and quality. It told the story I was seeing, that only slight trends were to be found, with more context than the alcohol X density X quality plots alone.

Another challenge was how to describe the alcohol X density X sugar relationship as it related to quality. I had in my mind that a three-dimensional histogram, colored by quality perhaps, or with a synthesized line showing some sort of quality-predicive plane, would be amazing. I tried plot3D, OceanView, and the generic R 3D modeling, but was unsatisfied with all of my experiments. That will be the next update I perform on this dataset. A larger dataset including the sample’s country of origin would be an exciting version of this - a multidimensional heat map linking the number of high-quality wines to the typical density profiles of the wines they produce.